A regular expression is a special string that describes a search pattern. Many of you have surely seen and used them already when typing expressions like ls(or dir) *.txt , to get a list of all the files with the extension txt. Regular expressions are very useful not only for pattern matching, but also for manipulating text. In SRMs regular expressions can be extremely handy. Many problems that require some coding can be written using regular expressions on a few lines, making your life much easier.
A regular expression(regex) is one or more non-empty branches, separated by ‘|’. It matches anything that matches one of the branches. The following regular expression will match any of the three words “the”,”top”,”coder”(quotes for clarity).
In java(1.4 and above) there is a package “java.util.regex” which allows usage of regular expressions.
This package contains three classes : Pattern, Matcher and PatternSyntaxException.
A Pattern object is a compiled representation of a regular expression. The Pattern class provides no public constructors. To create a pattern, you must call one of its public static compile methods, both of which will return a Pattern object.
A Matcher object is the engine that interprets the pattern and performs match operations against an input string. Like the Pattern class, Matcher defines no public constructors. You obtain a Matcher object by calling the public matcher method on a Pattern object.
A PatternSyntaxException object is an unchecked exception that indicates a syntax error in a regular expression pattern.
Example(adapted from[4]):
1 2 3 4 5 6 7 8 9 10 11 12
Pattern pattern; Matcher matcher; pattern = Pattern.compile(<REGEX>); matcher = pattern.matcher(<INPUT>); boolean found; while(matcher.find()) { System.out.println("Found the text \"" + matcher.group() + "\" starting at index " + matcher.start() + " and ending at index " + matcher.end() + "."); found = true; } if(!found){ System.out.println("No match found."); }
Java also offers the following methods in the String class:
boolean matches(String regex) (returns if the current string matches the regular expression regex)
String replaceAll(String regex, String replacement) (Replaces each substring of this string that matches the given regular expression with the given replacement.)
String replaceFirst(String regex, String replacement) (Replaces the first substring of this string that matches the given regular expression with the given replacement.)
String[] split(String regex) (Splits this string around matches of the given regular expression)
Many Topcoders believe that regular expressions are one of Java’s main strengths over C++ in the arena. C++ programmers don’t despair, regular expressions can be used in C++ too.
There are several regular expression parsing libraries available for C++, unfortunately they are not very compatible with each other. Fortunately as a Topcoder in the arena one does not have to cope with all this variety of “not so compatible with one another” libraries. If you plan to use regular expressions in the arena you have to choose between two flavors of regex APIs: POSIX_regex and GNU_regex. To use these APIs the header file “regex.h” must be included. Both of these work in two steps – first there is a function call that compiles the regular expression, and then there is a function call that uses that compiled regular expression to search or match a string.
Here is a short description of both of these APIs, leaving it up to the coder to choose the one that he likes the most.
Includes support for two different regular expression syntaxes, basic and extended. Basic regular expressions are similar to those in ed, while extended regular expressions are more like those in egrep, adding the ‘|’, ‘+’ and ‘?’ operators and not requiring backslashes on parenthesized subexpressions or curly-bracketed bounds. Basic is the default, but extended is prefered.
With POSIX, you can only search for a given regular expression; you can’t match it. To do this, you must first compile it in a pattern buffer, using `regcomp’. Once you have compiled the regular expression into a pattern buffer you can search in a null terminated string using ‘regexec’. If either of the ‘regcomp’ or ‘regexec’ function fail they return an error code. To get an error string corresponding to these codes you must use ‘regerror’. To free the allocated fields of a pattern buffer you must use ‘regfree’.
For an in depth description of how to use these functions please consult [2] or [3] in the References section.
Example:
Here is a small piece of code showing how these functions can be used:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
regex_t reg;
string pattern = “ [ ^ tpr] {
2,
}”;
string str = “topcoder”;
regmatch_t matches[1];
regcomp( & reg, pattern.c_str(), REG_EXTENDED | REG_ICASE);
if (regexec( & reg, str.c_str(), 1, matches, 0) == 0) {
cout << "Match "
cout << str.substr(matches[0].rm_so, matches[0].rm_eo - matches[0].rm_so)
cout << " found starting at: "
cout << matches[0].rm_so
cout << " and ending at "
cout << matches[0].rm_eo
cout << endl;
} else {
cout << “Match not found.”
cout << endl;
}
regfree( & reg);
The GNU_regex API has a richer set of functions. With GNU regex functions you can both match a string with a pattern and search a pattern in a string. The usage of these functions is somehow similar with the usage of the POSIX functions: a pattern must first be compiled with ‘re_compile_pattern’, and the pattern buffer obtained is used to search and match. The functions used for searching and matching are ‘re_search’ and ‘re_match’. In case of searching a fastmap can be used in order to optimize search. Without a fastmap the search algorithm tries to match the pattern at consecutive positions in the string. The fastmap tells the algorithm what the characters are from which a match can start. The fastmap is constructed by calling the ‘re_compile_fastmap’. The GNU_regex also provides the functions ‘re_search2′ and ‘re_match2′ for searching and matching with split data. To free the allocated fields of a pattern buffer you must use ‘regfree’.
For an in-depth description of how to use these functions please consult [3].
Example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
string pattern = “([a - z] + )
.\\1”;
string str = “top - topcoder | coder”;
re_pattern_buffer buffer;
char map[256];
buffer.translate = 0;
buffer.fastmap = map;
buffer.buffer = 0;
buffer.allocated = 0;
re_set_syntax(RE_SYNTAX_POSIX_EXTENDED);
const char * status = re_compile_pattern(pattern.c_str(), pattern.size(), & buffer);
if (status) {
cout << "Error: " << status << endl;
}
re_compile_fastmap( & buffer);
struct re_registers regs;
int ofs = 0;
if (re_search( & buffer, str.c_str(), str.size(), 0, str.size(), & regs) != -1) {
cout << "Match "
cout << str.substr(regs.start[0], regs.end[0] - regs.start[0])
cout << " found starting at: "
cout << regs.start[0]
cout << " and ending at "
cout << regs.end[0]
cout << endl;
} else {
cout << “Match not found.”
cout << endl;
}
regfree( & buffer);
The following examples are all written in Java for the sake of clarity. A C++ user can use the POSIX or the GNU regex APIs to construct functions similar to those available in Java(replace_all, split, matches).
CyberLine (SRM 187 div 1, level 1)
1
2
3
4
5
6
7
8
9
import java.util.*;
public class Cyberline {
public String lastCyberword(String cyberline) {
String[] w = cyberline.replaceAll("-", "")
.replaceAll("[^a-zA-Z0-9]", " “)
.split(”") ;
return w[w.length - 1];
}
}
UnLinker (SRM 203 div 2, level 3)
1
2
3
4
5
6
7
8
9
10
import java.util.*;
public class UnLinker {
public String clean(String text) {
String[] m = text.split("((http://)?www[.]|http://)[a-zA-Z0-9.]+[.](com|org|edu|info|tv)", -1);
String s = m[0];
for (int i = 1; i < m.length; ++i)
s = s + “OMIT” + i + m[i];
return s;
}
}
CheatCode (SRM 154 div 1, level 1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import java.util.;
public class CheatCode {
public int[] matches(String keyPresses, String[] codes) {
boolean[] map = new boolean[codes.length];
int count = 0;
for (int i = 0; i < codes.length; ++i) {
String regex = ".";
for (int j = 0; j < codes[i].length();) {
int k = 1;
while ((j + k) < codes[i].length() && codes[i].charAt(j + k) == codes[i].charAt(j)) k++;
regex = regex + codes[i].charAt(j) + “{
”+k + ",}";
j += k;
}
regex = regex + ".*";
if (keyPresses.matches(regex)) {
map\[i\] = true;
count++;
}
}
int\[\] res = new int\[count\];
int j = 0;
for (int i = 0; i < codes.length; ++i)
if (map\[i\] == true)
res\[j++\] = i;
return res;
}
}
The regex(7) linux manual page
The regex(3) linux manual page
http://docs.freebsd.org/info/regex/regex.info.Programming_with_Regex.html